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Abstract 

A mesh is a graph that divides physical space into regularly-shaped regions. Meshes computations 
form the basis of many applications, including finite-element methods, image rendering, collision detec- 
tion, and N-body simulations. In one important mesh primitive, called a mesh update, each mesh vertex 
stores a value and repeatedly updates this value based on the values stored in all neighboring vertices. 
The performance of a mesh update depends on the layout of the mesh in memory. Informally, if the 
mesh layout has good data locality (most edges connect a pair of nodes that are stored near each other in 
memory), then a mesh update runs quickly. 

This paper shows how to find a memory layout that guarantees that the mesh update has asymptoti- 
cally optimal memory performance for any set of memory parameters. Specifically, the cost of the mesh 
update is roughly the cost of a sequential memory scan. Such a memory layout is called cache-oblivious . 
Formally, for a cZ-dimensional mesh G, block size B, and cache size M (where M = Q.(B d )), the mesh up- 
date of G uses 0(1 + \G\ /B) memory transfers. The paper also shows how the mesh-update performance 
degrades for smaller caches, where M = o(B d ). 

The paper then gives two algorithms for finding cache-oblivious mesh layouts. The first layout 
algorithm runs in time 0(\G\ log 2 |G|) both in expectation and with high probability on a RAM. It uses 
G(l + |G| log 2 (|G| /M) /B) memory transfers in expectation and 0(1 + (\G\ /fi)(log 2 (|G| /M) +log |G|)) 
memory transfers with high probability in the cache-oblivious and disk-access machine (DAM) models. 
The layout is obtained by finding a fully balanced decomposition tree of G and then performing an 
in-order traversal of the leaves of the tree. 

The second algorithm computes a cache-oblivious layout on a RAM in time <9(|G| log |G|loglog |G|) 
both in expectation and with high probability. In the DAM and cache-oblivious models, the second lay- 
out algorithm uses 0(1 + (|G|/B)log(|G|/M)min{loglog |G|,log(|G| /M)}) memory transfers in expec- 
tation and 0(l + (|G|/B)(log(|G|/M)min{loglog|G[,log(|G|/M)} + log|G|)) memory transfers with 
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high probability. The algorithm is based on a new type of decomposition tree, here called a relax- 
balanced decomposition tree. Again, the layout is obtained by performing an in-order traversal of the 
leaves of the decomposition tree. 

1 Introduction 

A mesh is a graph that represents a division of physical space into regions, called simplices. Simplices 
are typically triangular (in 2D) or tetrahedral (in 3D). They are well shaped, which informally means that 
they cannot be long and skinny, but must be roughly the same size in any direction. Meshes form the basis 
of many computations such as finite-element methods, image rendering, collision detection, and N-body 
simulations. Constant-dimension meshes have nodes of constant-degree. 

In one important mesh primitive, each mesh vertex stores a value and repeatedly updates this value based 
on the values stored in all neighboring vertices. Thus, we view the mesh as a weighted graph G = (V,E,w,e) 
(w : V — > R, e : E — > R + ). For each vertex i € V, we repeatedly recompute its weight w; as follows: 

Wj = £ Wj e u . 

We call this primitive a mesh update. Expressed differently, a mesh update is the sparse matrix-vector 
multiplication, where the matrix is the (weighted) adjacency matrix of G, and vectors are the vertex weights. 

On a random access machine (RAM) (a flat memory model), a mesh update runs in linear time, regard- 
less of how the data is laid out in memory. In contrast, on a modern computer with a hierarchical memory, 
how the mesh is laid out in memory can affect the speed of the computation substantially. This paper studies 
the mesh layout problem, which is how to lay out a mesh in memory, so that mesh updates run rapidly on a 
hierarchical memory. 

We analyze the mesh layout problem in the disk-access machine (DAM) model [2] (also known as 
the I/O-model) and in the cache-oblivious (CO) model [17]. The DAM model is an idealized two-level 
memory hierarchy. These two levels could represent L2 cache and main memory, main memory and disk, 
or any other pair of levels. The small level (herein called cache) has size M, and the large level (herein 
called disk) has unbounded size. Data is transferred between the two levels in blocks of size B; we call 
these memory transfers. Thus, a memory transfer is a cache-miss if the DAM represents L2 cache and main 
memory and is a page fault, if the DAM represents main memory and disk. 

A memory transfer has unit cost. The objective is to minimize the number of memory transfers. Focusing 
on memory transfers, to the exclusion of other computation, frequently provides a good model of the running 
time of an algorithm on a modern computer. The cache-oblivious model is essentially the DAM model, 
except that the values of B and M are unknown to the algorithm or the coder. The main idea of cache- 
obliviousness is this: If an algorithm performs an asymptotically optimal number of memory transfers on a 
DAM, but the algorithm is not parameterized by B and M, then the algorithm also performs an asymptotically 
optimal number of memory transfers on an arbitrary unknown, multilevel memory hierarchy. 

The cost of a mesh update in the DAM and cache-oblivious models depends on how the mesh is laid out 
in memory. An update to a mesh G = (V,E) is just a graph traversal. If we store G's vertices arbitrarily in 
memory, then the update could cost as much as 0(\V\ + \E\) = 0(\G\) memory transfers, one transfer for 
each vertex and each edge. In this paper we achieve only 0(1 + \G\/B) memory transfers. This is the cost 
of a sequential scan of a chunk of memory of size 0(|G|), which is asymptotically optimal. 

Our mesh layout algorithms extend earlier ideas from VLSI theory. Classical VLSI-layout algorithms 
turn out to have direct application in scientific and I/O-efficient computing. Although these diverse areas 
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may appear unrelated, there are important parallels. For example, in a good mesh layout, vertices are stored 
in (one-dimensional) memory locations so that most mesh edges are short; in a good VLSI layout, graph 
vertices are assigned to (two-dimensional) chip locations so that most edges are short (to cover minimal 
area). 

Results 

We give two algorithms for laying out a constant-dimension well-shaped mesh G = (V,E) so that updates 
run in 0(1 + |G| /B) memory transfers, which is 0(1 + |V| /B) since the mesh has constant degree. 

Our first layout algorithm runs in time 0(|G|log 2 \G\) on a RAM both in expectation and with high 
probability^ In the DAM and cache-oblivious models, the algorithm uses 0(1 + (|G|/B)log 2 (\G\/M)) 
memory transfers in expectation and 0(1 + (|G|/fi)(log 2 (\G\/M) +log |G|)) memory transfers with high 
probability. The layout algorithm is based on decomposition trees and fully balanced decomposition trees [7, 
24] ; specifically, our mesh layout is obtained by performing an in-order traversal of the leaves of a fully- 
balanced decomposition tree. Decomposition trees were developed several decades ago as a framework for 
VLSI layout [7,24], but they are well suited for mesh layout. However, the original algorithm for building 
fully-balanced decomposition trees is too slow for our uses (it appears to run in time O(|G| ^), where b is 
the degree bound of the mesh). Here we develop a new algorithm that is faster and simpler. 

Our second layout algorithm, this paper's main result, runs in time 0(\G\ log \G\ log log \ G\) on a RAM 
both in expectation and with high probability. In the DAM and cache-oblivious models, the algorithm 
uses 0(1 + (|G| /fi)log(|G| /M)min{loglog |G|,log(|G| /M)}) memory transfers in expectation and 0(1 + 
(|G| /S)(log(|G| /M)min{loglog |G|,log(|G| /M)} + log |G|)) memory transfers with high probability. 

The algorithm is based on a new type of decomposition tree, which we call a relax-balanced decom- 
position tree. As before, our mesh layout is obtained by performing an in-order traversal of the leaves of a 
relax-balanced decomposition tree. By carefully relaxing the requirements of decomposition trees, we can 
retain asymptotically optimal mesh updates, while improving construction by nearly a logarithmic factor. 

The mesh-update guarantees require a tall-cache assumption on the memory system that M = Q.(B d ), 
where d is the dimension of the mesh. We also show how the performance degrades for small caches, 
where M = o{B li ). If the cache only has size 0(5 rf ~ e ), then the number of memory transfers increases to 
0(\ + \G\/B l - £ / d ). 

In addition to the main results listed above, this paper has contributions extending beyond I/O-efficient 
computing. First, our algorithms for building fully-balanced decomposition trees are faster and simpler than 
previously known algorithms. Second, our relax-balanced decomposition trees may permit some existing 
algorithms based on decomposition trees to run more quickly. Third, the techniques in this paper yield 
simpler and improved methods for generating fc-way partitions of meshes, earlier shown in [23]. More 
generally, we cross-pollinate several fields, including I/O-efficient computing, VLSI layout, and scientific 
computing. 

2 Geometric Separators and Decomposition Trees 

In this section we review the geometric-separator theorem [27], which we use for partitioning constant- 
dimensional meshes. We then review decomposition trees [24]. Finally, we show how to use geometric 
separators to build decomposition trees for well shaped meshes. 

'For input size N and event E, we say that E occurs with high probability if for any constant c > there exists a proper choice 
of constants defining the event such that Pr{£} > 1 —N~ c . 
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Geometric Separators 

A finite-element mesh is a decomposition of a geometric domain into a collection of interior-disjoint sim- 
plices (e.g., triangles in 2D and tetrahedra in 3D), so that two simplices can only intersect at a lower dimen- 
sional simplex. Each simplicial element of the mesh must be well shaped. Well shaped means that there is 
a constant upper bound to the aspect ratio, that is, the ratio of the radius of the smallest ball containing the 
element to the radius of the largest ball contained in the element [33]. 

A partition of a graph G = (V,E) is a division of G into disjoint subgraphs Go = (Vo,Eq) an d G\ = 
{V\ , E\ ) such that Vq f\V\ =0, and Vq U V\ = V. Go and G\ is a ^-partition of G if they are a partition of G 
and | Vb| , | Vi | < P|V|. We let E{Gq,G\) denote the set of edges in G crossing from Vq to Vi, and E{v,G\) 
denote the set of edges in G connecting vertex v to the vertices of G\. For a function f,G= (V,E) has 
a family of (/ ^-partitions if for each subset S C V and induced graph Gs = (Vs,E$), graph Gs has a 
p-partition of G So = (V So ,E So ) and G Sl = (V Sl ,E Sl ) such that \E S - E So - E Sl \ <f(\V s \). 

The following separator theorem of Miller, Teng, Thurston, and Vavasis [27] shows that meshes can be 
partitioned efficiently: 

Theorem 1 (Geometric Separators [27]) Let G = (V,E) be a well shaped finite-element mesh in d di- 
mensions (d > \). For constants £ (0 < £ < \) and c(e,d) depending only on £ and d, a (f(N) = 
0(N l ~ l / d ), (d + 1 + e) jid + 2))-partition ofG can be computed in 0(d\G\ +c(e,d)) time with probability 
at least 1/2. 

The separator algorithm from [27] works as follows. First, project the coordinates of the vertices of 
the input graph G onto the surface of a unit sphere in (d + 1) -dimensions. The projection of each point is 
independent of all other input points and takes constant time. Sample a constant number of points from all 
projected points uniformly at random. Compute a centerpoint of the sampled points. (A centerpoint of a 
point set in J-dimensions is a point such that every hyperplane through the centerpoint divides the point 
set approximately evenly, i.e., in the ratio of d to 1 or better.) Rotate and then dilate the sampled points. 
Both the rotation and dilation are functions of the centeipoint and the dimension d. Choose a random great 
circle on this unit sphere. (A great circle of a sphere is a circle on the sphere's surface that evenly splits 
the sphere.) Map the great circle back to a sphere in the J-dimensional space by reversing the dilation, the 
rotation, and the projection. Now use this new sphere to divide the vertices and the edges of the input graph. 

Now more mechanics of the algorithm. Mesh G is stored in an array. Each vertex of G is stored with its 
index (i.e., name), its coordinates, and all of its adjacent edges, including the index and coordinates of all 
neighboring vertices. (This mesh representation means that each edge is stored twice, once for each of the 
edge's two vertices.) 

To run the algorithm, scan the vertices and edges in G after obtaining the sphere separator. During the 
scan, divide the vertices into two sets, Go, containing the vertices inside the new sphere and G\, containing 
the vertices outside the sphere. Mark an edge as "crossing" if the edge crosses from Go to G\. Verify 
that the number of crossing edges, \E{Gq,G\)\, is 0{\G\ l ~ X ' d ), and if not, repeat. The cost of this scan is 
0(\G\ /B+ 1) memory transfers. 

The geometric separator algorithm has the following performance: 

Corollary 2 Let G = (V,E) be a well shaped finite-element mesh in d dimensions (d > I). For constants 
£ (0 < £ < \) and c(e,d) depending only on £ and d, the geometric-separator algorithm finds an (f(N) = 
0{N l ~ l l d ), (d+ 1 + e) / (d + 2)) -partition of G. The algorithm runs in G(|G|) on a RAM and uses 0(1 + 
\G\/B) memory transfers in the DAM and cache-oblivious models, both in expectation and with probability 
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at least 1 jl. With high probability, the geometric-separator algorithm completes in 0(\G\ log |G|) on a RAM 
and uses 0(1 + |G| log \G\/B) memory transfers in the DAM and cache-oblivious models. 

Proof A linear scan of G takes time 0(|G|) and uses an asymptotically optimal number of memory 
transfers. We expect to find a good separator after a constant number of trials, and so the expectation 
bounds follow by linearity of expectation. The probability that after selecting clg|G| candidate separators, 
none are good is at most l/2 clg l G l = |G|~ C . Thus, with high probability, the geometric separator algorithm 
completes in O ( | G | log \G\) on a RAM and uses 0(1 + |G|log \G\/B) memory transfers in the DAM and 
cache-oblivious models. The separator algorithm is cache-oblivious since it is not parameterized by B or M. 



Decomposition Trees 

A decomposition tree Tq of a graph G = (V,E) is a recursive partitioning of G. The root of Tq is G. Root 
G has left and right children Go and G\, and grandchildren Goo, Goi, Gio, G\\, and so on recursively down 
the tree. Graphs Go and G\ partition G, graphs Goo and Goi partition Go, and so on. More generally, a node 
in the decomposition tree is denoted G p (G p C G), where p is a bit string representing the path to that node 
from the root. We call p the id of G p . We say that a decomposition tree is ^-balanced if for all siblings 
Gpo = (V p o,E p o) and G p \ = (V p \,E p \) in the tree, |Vpo| , \V P \ | < P \Vp\- We say that a decomposition tree is 
balanced if P = 1/2. For a function /, graph G has an f decomposition tree if for all (nonleaf) nodes G p 
in the decomposition tree, \E{G p q,G p \)\ < f{\V p \). A ^-balanced / decomposition tree is abbreviated as an 
(/, ^-decomposition tree. 

For a parent node G p and its children G p o and G p \, there are several categories of edges. Inner edges 
connect vertices that are both in G p o or both in G p \. Crossing edges connect vertices in G p o to vertices 
in G p \. Outgoing edges of G p o (resp. G p \) connect vertices in G p o (resp. G p \) to vertices in neither set, 
i.e., to vertices in G — G p . Outer edges of G p o (resp. G p \) connect vertices in G p o (resp. G p \) to vertices 
in G — G p o (resp. G — G p \); thus an outer edge is either a crossing edge or an outgoing edge. More 
formally, inner(G p0 ) = E(G pQ ,G pQ ), crossing(G p ) = E(G p0 ,G p i), outgoing(G p0 ) = E(G pQ ,G - G p ), and 
outer(Gpo) = E(G p0 ,G - G pQ ). 

We build a decomposition tree Tq of mesh G recursively. First we run the geometric separator algorithm 
on the root G to find the left and right children, Go and G\. Then we recursively build the decomposition 
tree rooted at Go and then the decomposition tree rooted at G\. (Thus, the right child of Tq is not processed 
until the whole left subtree is built.) 

The decomposition tree is encoded as follows. Each leaf node G q for Tq stores the single vertex v and 
the bit string q (the root-to-leaf path). The leaf nodes of Tq are stored contiguously in an array Lq. The bit 
string q contains enough information to determine which nodes (subgraphs) of Tq contain v — specifically 
any node Gq, where q is a prefix of q (including q). As mentioned earlier, each vertex is stored along with 
its coordinates, adjacent edges, and coordinates of all neighboring vertices in G. (Recall that each edge is 
therefore stored twice, once for each of the edge's vertices.) Each edge e in G is a crossing edge for exactly 
one node in the decomposition tree Tq. In Tq, each edge e also stores the id p of the tree node G p for which 
e is a crossing edge. The bit strings on nodes and edges therefore contains enough information to determine 
which edges are crossing, inner, and outer for which tree nodes. Specifically, e 6 crossing (G p ). Let p be a 
prefix of p that is strictly shorter (p ^ p); then e € inner(G / 3). Let p be bit string representing a node in Tq 
where p is a strictly shorter prefix of p (p ^ p). Then e € outer(Gp). If pO and pi represent nodes in Tq, 
then e G outgoing (G p o) or e e outgoing(Gpi). 
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Thus, decomposition tree Tq is laid out in memory by storing the leaves in order in an array Lq. We 
do not need to store internal nodes explicitly because the bit strings on nodes and edges encode the tree 
structure. 

Here are a few facts about our layout of Tq. Given any two nodes G p and G q of Lq, the common prefix 
of p and q is the smallest node in Tq containing all vertices in both G p and G q . All the vertices in any node 
G p of Tq are stored in a single contiguous chunk of the array. Thus, we can identify for G p , which edges are 
inner, crossing, outer, and outgoing by performing a single linear scan of size 0{\G P \). 

We construct the decomposition tree Tq by recursively partitioning of G. While Tq is in the process 
of being constructed, its encoding is similar to the above, except that (1) a leaf node G q may contain more 
than one vertex, and (2) some edges may not yet be labelled as crossing. Thus, when the process begins, 
Tq is just a single leaf comprising G. The nodes are stored in a single array of size 0(|G|) and are stored 
in an arbitrary order. Then we run the geometric separator algorithm. Once we find a good separator, we 
partition G into Go and G\, and we store Go before G\ in the same array. We label vertices of Go with 
bit string and vertices of G\ with bit string 1. We then run through and label all crossing edges with the 
appropriate bit string (for the leaf node, the empty string). Now the nodes in each of Go and G\ are stored 
in an arbitrary order, but the subarray containing Go is stored before the subarray containing G\. We then 
apply the geometric separator algorithm for Go- We partition into Goo and Goi, label vertices in Go with 
00 or 01, and label all crossing edges of Go with the bit string 0; we then do the same for Goo and so on 
recursively until all leaf nodes are graphs containing a single vertex. 

We now give the complexity of building the decomposition tree. Our high-probability bounds are based 
on the following observation involving a coin with a constant probability of heads. In order to get at least 
one head with probability at least 1 — 1 /poly (N), ©(log N) flips are necessary and sufficient. In order to get 
&(logN) heads with probability at least 1 — l/poly(Af), the asymptotics do not change; &(logN) flips are 
still necessary and sufficient. The following lemma can be proved by Chernoff bounds (or otherwise): 

Lemma 3 Consider S > c log N flips of a coin with a constant probability of heads, for sufficiently large 
constant c. With probability at least 1 — 1 /poly(A^), &(S) of the flips are heads. 

Theorem 4 Let G = (V,E) be a well shaped finite-element mesh in d dimensions (d > \). Mesh G has a 
(2d + 3)/(2d + 4)-balanced-0(\V\ ll ^ d ) decomposition tree. On a RAM, the decomposition tree can be 
computed in time 0(|G|log |G|) both in expectation and with high probability. The decomposition tree can 
be computed in the DAM and cache-oblivious models using 0(1 + (|G|/S)log(|G|/M)) memory transfers 
in expectation and 0(1 + (|G|/fi)log |G|) memory transfers with high probability. 

Proof We first establish that the tree construction takes time 0(\G\ log |G|) on a RAM in expectation. The 
height of the decomposition tree is 0(log |G|), and the total size of all subgraphs at each height is 0(|G|). 
Since the decomposition of a subgraph takes expected linear time, the time bounds follow by linearity of 
expectation. 

We next establish that the tree construction uses 0(1 + (|G|/B)log(|G|/M)) expected memory transfers 
in the DAM and cache-oblivious models. Because we build the decomposition tree recursively, we give 
a recursive analysis. The base case is when a subtree first has size less than M. For the base case, the 
cost to build the entire subtree is 0(M/B) because this is the cost to read all blocks of the subtree into 
memory. Said differently, once a subgraph is a constant fraction smaller than M, the cost to build the 
decomposition tree from the subgraph is 0, because all necessary memory blocks already reside in memory. 
For the recursive step, recall that when a subgraph G p has size greater than M, the decomposition of a 
subgraph takes expected 0(\G p \/B) memory transfers, because this is the cost of a linear scan. Thus, 
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there are 0(log (\G\/M)) levels of the tree with subgraphs bigger than M, so the algorithms uses expected 
0(1 + (|G|/fl)log(|G|/M)) memory transfers. 

We next establish the high-probability bounds. We show that the building process uses 0(|G|log |G|) 
time on a RAM and 0(1 + |G|log|G|/B) memory transfers in the DAM and the cache-oblivious models 
with high probability. 

First consider all nodes that have size H(|G| /log |G|). There are ©(log |G|) such nodes. To build these 
nodes, we require a total of ©(log |G|) good separators. We can view finding these separators as a coin- 
flipping game, where we need @(log|G|) heads; by Lemma|3]we require ©(log |G|) coin flips. However, 
separators near the top of the tree are more expensive to find than separators deeper in the tree. We bound 
the cost to find all of these separators by the cost to build the root separator. Thus, building these nodes 
uses time 0(|G|log |G|) and 0(1 + |G|log \ G\/B) memory transfers with high probability. This is now the 
dominant term in the cost to build the decomposition tree. 

Further down the tree, where nodes have size 0(|G| / log |G|), the analysis is easier. Divide the nodes to 
be partitioned into groups whose sizes are within a constant factor of each other. Now each group contains 
£2(log |G|) elements. Thus, by Lemma[3]the time to build the rest of the tree with high probability equals 
the time in expectation, which is ©(jGjlog |G|). 

We now finish the bound on the number of memory transfers. As above, because we build the decom- 
position tree recursively, subtrees a constant fraction smaller than M are build for free. Also, because each 
group contains £2(log|G|) elements, the cost to build these lower levels in the tree with high probability 
equals the expected cost, which is 0(1 + (|G| /B)log (|G| /M)). This cost is dominated by the cost to build 
the nodes of size £2(|G| /log |G|). ■ 

3 Fully-Balanced Decomposition Trees for Meshes 

In this section we define fully-balanced partitions and fully-balanced decomposition trees. We give al- 
gorithms for generating these structures on a well shaped mesh G. As we show in Section 01 we use 
a fully-balanced decomposition tree of a mesh G to generate a cache-oblivious mesh layout of G. Our 
construction algorithm is an improvement over [7, 24] in two respects. First the algorithm is faster, requir- 
ing only 0(|G|log 2 |G|) operations in expectation and with high probability, 0(1 + (|G| /S)log 2 (|G| /M)) 
memory transfers in expectation, and 0(1 + (|G|/B)(log 2 (\G\/M) +log |G|)) memory transfers with high 
probability. Second, the result is simplified, no longer relying on a complicated theorem of [18]. 
This section makes it easier to present the main result of the paper, which appears in Section [5] 

Fully-Balanced Partitions 

To begin, we define a fully-balanced partition of a subgraph G p of G. A fully-balanced f -partition of 
G p C G is a partitioning of G p = (V p ,E p ) into two subgraphs G p o = (Vpo,E p o) and G p \ = (V p i,E p i) such 
that 

• (crossing (G p ) | </(|V p |), 

• |\>| = |^i|±0(l),and 

• (outgoing (G p0 ) | = (outgoing (G p i)| ± 0(1). 

We give the following result before presenting our algorithm for computing fully-balanced partitions. 
The existence proof and time complexity comprise the easiest case in [18]. 
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Figure 1 : Unfilled beads represent blue elements and filled beads represent red elements. Pick an arbitrary 
initial bisection A and A of the necklace. Here A contains more than half of all blue beads. (We can focus 
exclusively on blue beads because if A contains half of the blue beads to within one, it also contains half of 
red beads to within one.) We "turn" the bisection clockwise so that A takes one bead from A and relinquishes 
one bead to A. Thus, the number of blue beads in A can increase/decrease by one or remain the same after 
each turn. However, after N /2 turns, A becomes A, which contains less than half of all blue beads. So by a 
continuity argument, A contains half of all blue beads after some number of turns. The argument is similar 
for both odd and even N. 

Lemma 5 Given an array L ofN elements, where each element is marked either blue or red, there exists a 
subarray that contains half of the blue elements to within one and half of red elements to within one. Such 
a subarray can be found in 0(N) time and 0(1 +N/B) memory transfers cache-obliviously. 

Proof This result is frequently described in terms of "necklaces." Conceptually, attach the two ends of 
the array together to make a necklace. By a simple continuity argument (the easiest case of that in [18]), the 
necklace can be split into two pieces, A and A, using two cuts such that both pieces have the same number 
of blue elements to within one and the same number of red elements to within one. (For details of the 
continuity argument, see Figure [TJ) Translating back to the array, at least one of A and A does not contain 
the connecting point and is contiguous. 

To find a good subarray, first scan L to count the number of blue elements and the number of red 
elements. Now rescan L, maintaining a window of size N /2. The window initially contains the first half of 
L and at the end contains the second half of L. (For odd N, the middle element of the array appears in all 
windows.) Stop the scan once the window has the desired number of red and blue elements. 

Since only linear scans are used, the algorithm is cache-oblivious and requires 0(1 +N/B) memory 
transfers. ■ 

We now present an algorithm for computing fully-balanced partitions. Given G p C G, and a (f(N) = 
0(N a ), P) -partitioning geometric separator, FullyBalancedPartition (G p ) computes a fully-balanced (f(N) = 
0(Af a ))-partition G px and G py of G p . 
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FullyBalancedPartition (Gp) 

1. Build a decomposition tree — Build a decomposition tree Tq p of G p using the (f(N) = 0(N a ),$)- 
partitioning geometric separator. 

2. Build a red-blue array — Build an array of blue and red elements based on the decomposition tree 
To . Put a blue element for each leaf G q in 7^,; thus there is a blue element for each vertex v in 
G p . Now insert some red elements after each blue element. Specifically, after the blue element 
representing vertex v, insert E(v,G — G p ) red elements. Thus, the blue elements represent vertices in 
G p = (Vp,E p ) for a total of \V P \ blue elements, while the red elements represent edges to vertices in 
G — G p , for a total of E(G P , G — G p ) red elements. 

3. Find a subarray in the red-blue array — Find a subarray of the red-blue array based on Lemma [5] 
Now partition the vertices in G p based on this subarray. Specifically, put the vertices representing 
blue elements in the subarray in set V px and put the remaining vertices in G p in set V py . 

4. Partition G p — Compute G px and G py from V px and V py . This computation also means scanning 
edges to determine which edges are internal to G px and G py and which have now become external. 

We first establish the running time of FullyBalancedPartition(Gp). 

Lemma 6 Given a graph G p that is a subgraph of a well shaped mesh G, FullyBalancedPartition(Gp) 
runs in 0(\G p \log\G p \) on a RAM, both in expectation and with high probability (i.e., probability at 
least 1 — l/poly(|G p |)J. In the DAM and cache-oblivious models, FullyBalancedPartition(Gp) uses 
0(1 + (|Gp|/5)log (\G p \/M)) memory transfers in expectation and 0(1 + |Gp|log \G p \/B) memory trans- 
fers with high probability. 

Proof According to Theorem 01 Step 1 of FullyBalancedPartition (G p ) (computing Tq ) takes time 
0(\G P \ log \G P \) on a RAM, both in expectation and with high probability. In the DAM and cache- 
oblivious models, this steps requires G(l + (\G P \/B) log (\G P \/M)) memory transfers in expectation and 
0(1 + | G p | log \G P \/B) memory transfers with high probability. Steps 2-4 of FullyBalancedPartition (G p ) 
each require linear scans of an array of size 0(|Gp|), and therefore are dominated by Step 1. ■ 

We next establish the correctness of FullyBalancedPartition(Gp). In the following, let constant b repre- 
sent the maximum degree of mesh G. 

Lemma 7 Given a well shaped mesh G and a subgraph G p C G, FullyBalancedPartition generates a fully- 
balanced partition of G p . 

Proof By the way that we generate V px and V py , we have 

Hvg-iM^ 1 - 

This is because the number of blue elements in the subarray is exactly \V px \, and the number of blue elements 
within and without the subarray differ by at most one. 
We next show that 

| (outgoing (G p> ,) | -|outgoing(Gp t ) 1 1 <2b+l. (1) 

To determine (outgoing (G px ) \ and (outgoing (Gpy) |, modify the subarray as follows. Remove from the sub- 
array any red elements at the beginning of the subarray before the first blue element in the subarray. Then 
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(a) An example subgraph G p of mesh G. Subgraph G p 
has eight vertices, ten edges, and eight outer edges 
(i.e., |outer(G p )| = 8). 



® o 



(1,5,6,7)1 



(1,2,3,4,5,6,7,8) 

(1,2) (4,7) 



1(2,3,4,8) 
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(b) A decomposition tree of the subgraph G p from |(a)| Building this 
decomposition tree is the first step for FullyBalancedPartition(Gp). 
The crossing edges at each node are indicated by lines between 
the two children. Thus, crossing((G ; ,)o) = {(1,5), (6,7)} and 
crossing((Gp)ioi) = {(4,8)}. Observe that each edge in G p is a 
crossing edge for exactly one node in the decomposition tree. 



o © o o © © 
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(c) The red-blue array for G p . The blue elements have a dark shade. The red elements have a light shade. There is 
one blue element for each vertex in G p . There is one red element for each outgoing edge in G p . Since element 1 
is adjacent to two edges in outer(Gp), there are two red elements after it in the red-blue array. The figure indicates 
a subarray containing half of the blue elements and half of the red elements to within one. The red-blue array is 
used to make the fully-balanced partition of G p . Specifically, G px will contain vertices 2, 5, 6, and 7 and G py will 
contain vertices 1, 3, 4, and 8. Partition G px inherits three outer edges from G p , and partition G pv inherits five outer 
edges from G p . This particular subarray means that two paths in the decomposition tree will be cut. One path, 
separating element 1 from 6, goes from node (Gp)no to the root. The other path, separating element 2 from 4, goes 
from node (G p )\o to the root. The edges that are cut by this partition are the crossing edges of these nodes, i.e., 
E(G px ,G py ) = {(1,6), (1,5), (6, 7), (1,2), (4, 7), (2,3), (3, 8), (2,4), (2,8)}. If G p is anode in the fully-balanced 
decomposition tree, then its left child will be G px and its right child will be G py . 



Figure 2: The steps of the algorithm FullyBalancedPartition(G p ) run on a sample graph. 



add to the subarray any red elements before the first blue element after the subarray. The number of red 
elements now in the subarray is |outgoing(G /M -)| and the number of red elements not in the subarray is 
|outgoing(G py )|. This modification can only increase or decrease [outgoing (G pA )| and (outgoing (G /; , y )| each 
by b, establishing O. 

Now, following [7,24], we show that 

E(G px ,Gpy) < c\V p \ a (l + P«)/(l - P«). (2) 

By selecting a subarray of the red-blue array, we effectively make two cuts on the leaves of the decompo- 
sition tree Tq . (The only time when there is apparently a single cut is if the subarray is the first half of the 
array. In this case, the second cut separates the first leaf from the last.) Consider one of these cuts. The array 
is split between two consecutive leaves of Tq . Denote by P the root of the smallest subtree of Tc p containing 
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these two leaves; see Figure |2(c)j We consider the upward path P,P\,P2, ■ . ■ ,G P in the decomposition tree 
Tq from P up to the root G p of Tq . Each node in the decomposition tree on this path is a subgraph of G 
that is being split into two pieces. 

We now count the number of edges that get removed as a result of these splits: 

logp|V| 

| crossing (P)U crossing (Pi )U crossing (i^)U...U crossing (G p ) | < ]T c(|V|/P ! ) 

i=0 

< c|V|7(l-p a ). (3) 

As reflected in (O, each node along the path has a different depth, which gives a geometric series. 

The number of edges that cross from G px to G py , E(G px ,G py ), is the number of edges that get removed 
when both cuts get made. However, doubling ([3]) overestimates E(G px ,G py ) by an amount |crossing(G p )| 
since the root G p can only be cut once. Thus, doubling and subtracting | crossing (G p )\, we establish (0. 



Fully-Balanced Decomposition Trees 

A fully-balanced decomposition tree of a graph G is a decomposition tree of G where the partition of every 
node (subgraph) in the tree is fully-balanced. 

We build a fully-balanced decomposition tree BTq of G recursively. First we apply the algorithm Fully- 
BalancedPartition on the root G to find the left and right children, Go and G\. We next recursively build the 
fully balanced decomposition tree rooted at Go and the fully-balanced decomposition tree rooted at G\. 

Theorem 8 (Fully-Balanced Decomposition Tree for a Mesh) A fully-balanced decomposition tree of a 
mesh G of constant dimension can be computed in time 0{\G\ log 2 |G|) on a RAM both in expecta- 
tion and with high probability. The fully-balanced decomposition tree can be computed in the DAM 
and cache-oblivious models using G(l + (|G|/S)log 2 (|G|/M)) memory transfers in expectation and 
0(1 + (|G|/B)(log 2 (\G\/M) -Hog |G|)) memory transfers with high probability. 

Proof We first establish that the construction algorithm takes expected time G(|G| log 2 |G|) on a RAM. By 
Lemma [6l for any node G p in the decomposition tree, we need 0(\G p \\og \ G P \) operations to build the left 
and right children, G p q and G p \, both in expectation and with probability at least 1 — 1 /poly(|G p |). Since the 
left and right children, |G p o| and the \G P \\, of every node G p differ in size by at most 1, BTq has ©(log |G|) 
levels. If \G P \ denotes the size of a node at level i, then level i has construction time G(|G| log \G P \). Thus, 
the construction-time bound follows by linearity of expectation. 

We next establish that the construction algorithm uses <9(l + |G|log 2 (|G|/M)/B) expected memory 
transfers in the DAM and cache-oblivious models. Because we build the decomposition tree recursively, 
we give a recursive analysis. The base case is when a node G p has size less than M while its parent node 
is greater than M. Then the cost to build the entire subtree Tq is only 0(M/B), because this is the cost to 
read all blocks of G p into memory. Said differently, once a node is a constant fraction smaller than M, the 
cost to build the fully-balanced decomposition tree is because all necessary memory blocks already reside 
in memory. There are therefore ©(log |G| — logM) levels of the fully-balanced decomposition tree having 
nonzero construction cost. Each level uses at most 0{{\G\ /B)\og{\G\ /M)) memory transfers. Thus, the 
time bounds follows by linearity of expectation. 
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We next establish the high-probability bounds. In the following analysis, we examine, for each node G p 
in the fully-balanced decomposition tree, the decomposition tree Tq that is used to build that node. We then 
group the nodes of all the decomposition trees by size and count the number of nodes in each group. 

As an example, suppose that |G| is a power of two and all splits are even. There is one node of size |G| 
— the root node of the decomposition tree Tq. There are four nodes of size \G\/2 — two nodes in Tq, one 
node in Tq , and one node in Tq { - There are 12 node of size \G\/4 — four nodes in Tq, two nodes in Tq , 
two node in Tq ] , and one node in each of Tc m , Tq 01 , Tq 10 , and Tq u . 

In general, let group i contain all decomposition tree nodes having size in the range (\G\/2 l , |G|/2' -1 ]. 
Then group i contains 0(/2') nodes. 

Analyzing each group separately, we show that the construction algorithm takes time 0(\G\ log 2 |G|) on 
a RAM with high probability. First, consider the ®(log \G\) largest nodes (those most expensive to build), 
i.e., those in the smallest cardinality groups. As analyzed in Theorem 01 building these nodes takes time 
0(\G\ log \G\) with high probability. 

We analyze the rest of the node constructions group by group. Since each group i contains ®(i2'~ l ) 
nodes, each successive group contains more nodes than the total number of nodes in all smaller groups. As 
a result, there are £2(log |G|) nodes in each of the rest of the groups. Thus, by Lemma[3l the time to build the 
rest of the tree with high probability is the same as the time in expectation, which is 0(|G|log 2 \G\). Thus, 
we establish high-probability bounds on the running time. 

We now show that the construction algorithm takes 0(1 + (|G|/B)(log 2 (|G|/M) +log |G|)) memory 
transfers with high probability. First consider the ©(log |G|) largest nodes (those most expensive to build). 
As analyzed in Theorem 01 building these nodes uses 0(1 + |G|log|G|/B) memory transfers with high 
probability. Now examine all remaining nodes. We consider each level separately. Each group contains 
£2 (log |G|) nodes. Thus, by Lemma |3l the high-probability cost of building the decomposition trees for all 
remaining nodes matches the expected cost, which is 0(1 + (|G| /fi)log 2 (|G| /M)) memory transfers. Thus, 
with high probability, the construction algorithm takes 0(1 + (|G| /B)(log |G| + log 2 (|G| /M))) memory 
transfers with high probability, as promised. ■ 

fc-Way Partitions 

We observe one additional benefit of Theorem [8] In addition to providing a simpler and faster algorithm for 
constructing fully-balanced decomposition trees, we also provide a new algorithm for &-way partitioning, 
as described in [23]. For any positive integer k > 1, a k-way partition of a graph G = (V,E), is a &-tuple 
(Vi ,V 2 , ■ ■ ■ , V k ) (hence (G U G 2 , . . . ,G k )) such that VJi<i< k Vi = V and V { HVj = for i^ j, 1 < i, j < k. For any 
P > 1, {Vi,V 2 ,...,V k ) is a (p\A:)-way partition if |G,| < P[|G|/Jk], for all i £ {1, ...,&}. It has been shown 
in [23] that every well shaped mesh in d dimensions has a (1 +e,fc)-way partition, for any £ > 0, such that 
max,{outer(G;)} = 0((|G| /k) l ~ l l d ). 

We now describe our &-way partition algorithm of a well shaped mesh G. The objective is to evenly 
divide leaves of a fully-balanced decomposition tree of G into k parts such that their number of vertices are 
the same within one. First build a fully-balanced decomposition tree. Now assign the first \V\/k leaves to 
V\, the next \V\/k leaves to V 2 , and so on. 

In fact, we can modify this approach so that it runs faster by observing that we need not build the 
complete fully-balanced decomposition tree. First build the top ©(log/:) levels of the tree, so that there 
are poly(&) leaves. At most k of these leaves need to be refined further, since the remaining leaves will all 
belong to a single group V,-. 
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Our &-way partition algorithm using fully -balanced decomposition trees is incomparable to the algorithm 
of [23]. By building fully-balanced decomposition tree, even a partial one, our algorithm is slower than the 
algorithm of [23], which uses geometric separators for partitioning instead. On the other hand, it can be used 
to divide the nodes into k sets whose sizes are equal to within an additive one, instead of only asymptotically 
the same size as in [23]. 

4 Cache-Oblivious Layouts 

In this section we show how to find a cache-oblivious layout of a mesh G. Given such a layout, we show 
that a mesh update runs asymptotically optimally in 0(1 + \G \ /B) memory transfers given the tall cache 
assumption that M = Q.{B d ). We also analyze the performance of a mesh update when M = o(B d ), bounding 
the performance degradation for smaller M. 
The layout algorithm is as follows. 

CacheObliviousMeshLayout(G) 

1. Build a f(N) = 0{N l ~ l l d ) fully-balanced decomposition tree Tq of G, as described in Theorem[8] 

2. Reorder the vertices in G according to the order of the leaves in Tq. (Recall that each leaf in Tq 
stores a single vertex in G.) This reorder means: (a) assign new indices to all vertices in the mesh, 
and (b) for each vertex, let all neighbor vertices know the new index. 

We now describe the mechanics of relabeling and reordering. Each vertex knows its ordering and 
location in the input layout; this is the vertex's index. A vertex also knows the index of each of its 
neighboring vertices. When we change a vertex's index, we apprise all neighbor vertices of the change. 
These operations entail a small number of scans and cache-oblivious sorts [9, 11, 17, 30], for a total cost of 
0((|G| /B)log M / e (|G| /B) memory transfers. This cost is dominated by the cost to build the fully-balanced 
decomposition tree. (Thus, a standard merge sort, which does not minimize the number of memory transfers, 
could also be used.) 

The cleanest way to explain is through an example. Suppose that we have input graph G = 
{{a,b,c,d} , {(a,c), (a,d),(b,c), (c,d)}}, which is laid out in input order: 

(a,c), (a,d),(b,c), (c,a), (c,b), (c, d),(d, a), (d,c) . 

Suppose that the leaves of fully-balanced decomposition tree are in the order of a,c,d,b. This means that 
the renaming of nodes is as follows: [a: 1], [c : 2], [d : 3], [b : 4]. (For clarity, we change indices from letters 
to numbers.) We obtain the reverse mapping [a: 1], [b : 4], [c : 2], [d : 3] by sorting cache-obliviously. We 
change the labels on the first component of the edges by array scans: 

(a = l,c), (a = l,d), (b = 4,c), (c = 2, a), (c = 2,b), (c = 2,d), [d = 3, a), {d = 3,c) . 
We then sort the edges by the second component, 

(c = 2, a), (d = 3, a), (c = 2,b), (a = l,c), (b = 4,c), [d = 3,c), [a = l,d), (c = 2,d) , 

and change the labels on the second component of the edge by another scan: 

(c = 2,a= l),(d = 3,a= 1), (c = 2,b = 4), (a = l,c = 2), (b = 4,c = 2), (d = 3,c = 2), 
(a = l,d = 3),(c = 2,d = 3). 
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We get 

(2.1) , (3,1), (2,4), (1,2), (4,2), (3,2), (1,3), (2,3). 

We sort these edges by the first component to get the final layout. The final layout is 

(1.2) , (1,3), (2,1), (2,3), (2,4), (3,1), (3,2), (4,2). 
Thus, we obtain the following layout performance: 

Theorem 9 A cache-oblivious layout of a well shaped mesh G can be computed in 0(\G\ log 2 |G|) 
time both in expectation and with high probability. The cache-oblivious layout algorithm uses 0(1 + 
|G|log 2 (|G| /M)/B) memory transfers in expectation and 0(1 + (|G| /5)(log 2 (|G| /M) + log|G|)) memory 
transfers with high probability. 

With such a layout, we can perform a mesh update cache-obliviously. 

Theorem 10 Every well shaped mesh G in d dimensions has a layout that allows the mesh to be updated 
cache-obliviously with 0(1 + |G| /B) memory transfers on a system with block size B and cache size M = 
Q.(B d ). 

Proof We apply the algorithm described above on G to build the layout. Since each vertex of G has 
constant degree bound b, its size is bounded by a constant. Consider a row of nodes G Pl , G P2 , G P3 ... in 
at a level such that each node G p . uses &(M) < M space and therefore fits in a constant fraction of memory. 

In the mesh update, the nodes of G are updated in the order of the layout, which means that first the 
vertices in G Pl are updated, then vertices of G P2 , then vertices of G P3 , etc. To update vertices of G Pi , the 
vertices must be brought into memory, which uses at most 0(1 +M/B) memory transfers. In the mesh 
update, when we update a vertex u, we access u's neighbors. If the neighbor v of u is also in G Pi , i.e., edge 
(w,v) is internal to G Pi , then accessing this neighbor uses no extra memory transfers. On the other hand, if 
the neighbor v is not in G Pi , then following this edge requires another transfer hence an extra block to be 
read into memory. 

We now show that outer(G Pi ) = 0(|G Pi | 1-1 ^). Since all subgraphs at the same level of the fully- 
balanced decomposition tree are of the same size within one, and outgoing edges of any subgraph are evenly 
split, each G Pi has roughly the same number of outer edges. Suppose G Pi is in level j. The total number of 
their outer edges are at most 

Hence, outer(G Pi ) = 0((|G| /2 j ) l ~ l,d ) = 0(\G P , \ l ~ l/d ) = 0{M l ~ 1 l d ). Therefore the total size of mem- 
ory that we need to perform a mesh update of the vertices in G Pi is ®(M + BM x ~ x l d ). 

By the tall-cache assumption that B d < M, i.e., B < M x / d , and for a proper choice of constants, the mesh 
update for G Pi only uses &(M) < M memory. Since updating each node G Pi of size &(M) uses 0(1 +M/B) 
memory transfers, and there are a total of 0(|G| /M) such nodes, the update cost is 0(1 + |G| /B), which 
matches the scan bound of G, and is optimal. ■ 

Thus, for dimension d = 2, we have the "standard" tall-cache assumption [17], and for higher dimensions 
we have a more restrictive tall-cache assumption. We now analyze the tradeoff between cache height and 
complexity. Suppose instead of a cache with M = Q.(B d ), the cache is of M = Q.(B d ~ e ). We assume 
e < d — 1 . We show that the cache performance of mesh update is B E / d away from optimal. 
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Corollary 11 Every well shaped mesh G in d dimensions has a vertex ordering that allows the mesh to 
be updated cache-obliviously with 0(1 + |G| /B l ~ £ ^ d ) memory transfers on a system with block size B and 
cache size M = Q.(B d ~ E ). 

Proof We apply similar analysis to that in Theorem [10] on G. From Theorem[lOl the total size of memory 
that we need to update mesh G Pi is &(M + BM l ~ l l d ). Since M = £l(B d ~ £ ), we have 

0(M + BM^ l/d ) = 0(M + M 



= 0{M + MB & l d ). 

Thus, updating G Pi uses 0(1 + (M + MB E l d )/B) memory transfers, which simplifies to 0(1 + \G\ /B^ e/d ) 
memory transfers. ■ 



5 Relax-Balanced Decomposition Trees and Faster Cache-Oblivious Lay- 
outs 

In this section we give the main result of this paper, a faster algorithm for finding a cache-oblivious mesh 
layout of a well-shaped mesh. The main idea of the algorithm is to construct a new type of decomposition 
tree, which we call a relax-balanced decomposition tree. The relax-balanced decomposition tree is based on 
what we call & relax-balanced partition. We give an algorithm for building an relax-balanced decomposition 
tree whose performance is nearly a logarithmic factor faster than the algorithm for building a fully-balanced 
decomposition tree. We prove that an asymptotically optimal cache-oblivious mesh layout can be found by 
traversing the leaves of the relax-balanced decomposition tree. 

Relax-Balanced Partitions 

We first define the relax-balanced partition of a subgraph G p of G. A relax-balanced f -partition of G p C G 
is a partitioning of G p into two subgraphs G p q and G p \ such that 

• grossing (Gp) | < f(\G p \), 

• \G p0 \ = \G p i\± 0{\G P \ /log 3 |G|), and 

• |outgoing(G p0 )| = |outgoing(G p i)| ± <9( (outgoing (G p i)| /log 2 |G|). 

We next present an algorithm, RelaxBalanced Partition, for computing balanced partitions. Given 
G P C G, and a (f(N) = 0(N a ), p) -partitioning geometric separator, RelaxBalancedPartition(G p ) computes 
a relax-balanced (f(N) = 0(/V a ))-partition G px and G py of G p . 

We find the relax-balanced partition by building what we call a relax partition tree Tq ■ We call the top 
31og^plog |G| levels of Tq p the upper tree of Tq p and the remaining levels the lower tree of Tq p . 

We build the upper tree by building the top Slogj/plog |G| levels of a decomposition tree of G p . By 
construction, all leaves of the upper tree (subgraphs of G p ) contain at most \G P \ / log 3 |G| vertices. Outer 
edges of G„ are distributed among these leaves. By a counting argument, there are at most log 2 |G| leaves 
that can contain more than |outer(G / ,)| /log 2 [G| outer edges of G p . 
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For each upper-tree leaf having more than |outer(G p ) | / log 2 |G| outer edges, we refine the leaf by build- 
ing a decomposition tree on it. We do not refine the other leaves of the upper tree. The union of these 
decomposition trees comprise the lower tree. 

Relax partition tree Tg has leaves at different depths. Some leaves are subgraphs having a single vertex 
while others may have up to |G|/log 3 |G| vertices. The tree is stored in the same format as a standard 
decomposition tree. Thus, leaves of the relax partition tree that are not refined contain vertices stored in an 
arbitrary order. The relax partition tree Tq of G p is just a decomposition tree if there are fewer than log 3 \G\ 
vertices. 

RelaxBalancedPartition (G p ) 

1. Build Tg — Build the relax partition tree Tg from G p recursively. 

2. Build red-blue array — Build an array of vertices by an in-order traversal of leaves of Tg . Vertices 
in leaves that are not refined are laid out arbitrarily. Build a red-blue array and find a subarray in the 
red-blue array as described in FullyBalancedPartition. 

3. Modify red-blue array - Modify the subarray to satisfy the following constraint. All vertices in an 
(unrefined) leaf must stay together, either within or without the subarray. If any cut separates the 
vertices, then move the cut leftward or rightward to be in between the leaf node and a neighbor. 
Now partition the vertices in G p based on this modified subarray. Put the vertices representing blue 
elements that are in the subarray into set V px and put the vertices representing blue elements that are 
outside of the subarray into set V py . 

4. Partition G p — Compute G px and G py from V px and V py . This computation also means scanning 
edges to determine which edges are internal to G px and G py and which have now become external. 

We first establish the running time of RelaxBalanced Partition (G p ). 

Lemma 12 Given a subgraph G p of a well shaped mesh G, \G P \ > log 3 \G\, RelaxBalancedPartition(G p ) 
runs in time 0(|G p |loglog \G\) on a RAM and 0(1 + (\G P \ /B)min{loglog |G|,log(|G p | /M)}) memory 
transfers in the DAM and cache-oblivious models in expectation. With high probability, it runs in 
0(\G p \\og \G p \) on a RAM and 0(1 + \G p \ log \G p \/B) memory transfers in the DAM and cache-oblivious 
models. 

Proof We establish that the construction algorithm runs in expected time 0(\G P \ log log |G|) on a RAM. 
The upper tree of Tg takes expected time 0(|G P | log log |G|). There are at most log 2 |G| leaves of the upper 
tree to be refined. For each of these leaves, we build a decomposition tree, and this takes expected time 

0((\G P \ /log 3 |G|)log(|G p | /log 3 |G|)) < 0(\G P \ /log 2 |G|). 

Thus, the total expected time to refine all leaves is G(|G|). Steps 2-4 takes linear time. Thus, 
RelaxBalancedPartition(Gp) finds a relax-balanced partition in expected time 0(|G p |loglog |G|). 

We next establish that the construction algorithm uses G(l + (|G p |/fi)min{loglog |G|,log(|G p | /M)}) 
expected memory transfers in the DAM and cache-oblivious models. There are two cases. The first case is 
whenM> \G P \ /log 3 |G|. Then some of nodes in the top 31ogj/plog |G| levels of the Tq p may be a constant 
fraction smaller than M. Such small nodes require no memory transfers to build, because they are already 
stored in memory. Only the top 0(log(|G p | /M)) levels use memory transfers. The rest of the decomposi- 
tions are free of memory transfers because all necessary memory blocks already reside in memory. When a 
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(a) An example subgraph G p of mesh G. Subgraph G p 
has eight vertices, ten edges, and eight outer edges 
(i.e., |outer(G p )| = 8). 



(1,2) (4,7) 



1(2,3,4,8] 



, , d.5)(6,7) , 

Tfi l ( SJ) 



0^ 



(b) A relax partition tree of the subgraph G p from |(a)| Building this 
decomposition tree is the first step for RelaxBalancedPartition(G p ). 
Observe that each edge in G p is a crossing edge for at most one node 
in the decomposition tree. Some edges, such as (2,4), are not cross- 
ing edges for any node. The top three levels of the decomposition 
tree are the upper tree. We refine a leaf of the upper tree if only it has 
many (at least three) edges from outer(Gp). Upper tree leaf (G p )oo 
has 4 edges from outer(Gp). Upper tree leaf (G p )o\ has 1 edge from 
outer(Gp). Upper tree leaf (G p )io has 1 edge from outer(Gy,). Upper 
tree leaf (G p )n has 2 edges from outer(Gp). Thus, only (G p )oo is 
further refined. 
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(c) The red-blue array for G p . The blue elements have a dark shade. The red elements have a light shade. There is one 
blue element for each vertex in G p . There is one red element for each outgoing edge in G p . The figure indicates 
a subarray containing half of the blue elements and half of the red elements to within one. However, this subarray 
separates element 8 from element 2. This cut is not allowed because 8 and 2 are in the same leaf of the relax 
partition tree. Instead the cut is moved left to the first valid position. The new cut separates element 5 from element 
8, which is allowed because 5 and 8 are in different leaves of the relax partition tree. Thus, G px will contain vertices 
5, 6, and 7, and G py will contain vertices 1, 2, 3, 4, and 8. 



Figure 3: The steps of the algorithm RelaxBalanced Partition^,) run on a sample graph. 



subgraph G p has size £1(M), then the partition of a subgraph takes expected 0(|G P | /B) memory transfers, 
because this is the cost of a linear scan. Hence, the total cost is 0(1 + {\G P \ /B)\og{\G p \ /M)). 

The second case is whenM < \G P \ / log 3 \G\. Then, the upper tree of To takes 0(1 + \G P \ log log \G\/B) 
memory transfers in expectation. There are at most log 2 \G\ leaves of the upper tree of Tq that need further 
refinement, and the leaf sizes are at most \G P \ / log 3 \ G\. Building a decomposition tree on one of these 
leaves takes 

0(1 + {\G P \ /log 3 |G|)log(|G p | /log 3 \G\)/B) < 0(1 + \G P \ /Slog 2 |G|) 

memory transfers in expectation. Since there are at most log 2 \G\ leaves, the total expected number of 
memory transfers to construct the lower tree of Tq > is 0(\G P \ /B), which is dominated by the cost to build 
the upper tree. 

Combining the two cases, we obtain that the expected number of memory transfers to build Tq is 
0(1 + (\G P \ /5)min{loglogG,log(|G p | /M)}). 
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We next establish the high-probability bounds. We first consider all nodes that have size 
n(|G p | /log \G P \). There are 0(log|G p |) such nodes. Building these nodes uses time 0(|G p |log \G P \) 
and 0(1 + \G P \ log \G P \/B) memory transfers with high probability by Theorem[4] 

For the rest of the upper tree of Tq , each level contains n(log|G p |) nodes. Thus, the number of 
memory transfers with high probability matches the number of memory transfers in expectation, which is 
0(1 + (\G P \ /B)min{loglogG,log(jGp| /M)}). The cost to build the rest of the upper tree is dominated by 
the cost to build the largest G(log \ G P \) nodes in the upper tree. 

As described above, the expected cost to build the lower tree is 0(\G P \) time and 0{\G p \/B) memory 
transfers. The high-probability bounds are at most a G(log|G p |) factor greater and hence are dominated 
by the cost to build the upper tree. Thus, we establish the high probability bounds on time and memory 
transfers. ■ 

We next establish the correctness of RelaxBalancedPartition(G p ). In the following, let b represent the 
maximum degree of mesh G. 

Lemma 13 Given a well shaped mesh G and a subgraph G p C G, RelaxBalancedPartition(Gp) generates a 
relax-balanced partition ofG p . 

Proof By the way we construct the relax partition tree Tq , nodes that are not refined contain 
0( (outer (G p )| / log 2 |G|) outer edges of G p , and their sizes differ by 0(|G P | / log 3 |G|). Thus, by the way we 
generate G px and G py , the number of outgoing edges of G px and G py differ by 0(|outer(G p )| /log 2 |G|) and 
\G px \ and \G py \ differ by 0(\G P \ /log 3 |G[). Recall that outgoing (G px ) U outgoing (G py ) = outer(G /) ). Thus, 
we have 

(outgoing (G px ) \ = (outgoing (Gpy) | ± 0( (outgoing (G py )\ / log 2 |G|). 
As shown in Equation (O from Lemma |7J the number of crossing edges satisfies (crossing (G p )\ < 

f(\G P \).m 

Relax-Balanced Decomposition Trees 

A relax-balanced decomposition tree of a well shaped mesh G is a decomposition tree of G where every 
partition of every node G p in the tree is relax-balanced. 

We construct a relax-balanced decomposition tree of G recursively. First we apply the algorithm Re- 
laxBalancedPartition on the root G to get the left and right children, Go and G\. We next recursively build 
the (left) subtree rooted at Go and then the (right) subtree rooted at G\. 

Theorem 14 (Relax-Balanced Decomposition Tree for a Mesh) A relax-balanced decomposition tree of 
a well shaped mesh G of constant dimension can be computed in time G(|G|log |G|loglog |G|) on a RAM 
both in expectation and with high probability. The relax-balanced decomposition tree can be computed in the 
DAM and cache-oblivious models using 0(1 + (|G|/B)log (|G|/M) minjloglog |G|,log((G| /M)\) memory 
transfers in expectation and 0(1 + (|G|/S)(log (|G|/M)min{loglog |G|,log(|G| /M)} + log |G|)) memory 
transfers with high probability. 

Proof When |G| < M, the construction algorithm takes 0(|G|) time and 0(\G\/B) memory transfers, 
both in expectation and with high probability. We consider 0(|G|) = Q.(M) in the following analysis. 
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We first analyze the expected running time of the algorithm on a RAM. The construction time of each 
node G p is 0(|G P | log log |G|), and there are 0(log|G|) levels in the relax-balanced decomposition tree. 
Thus, by linearity of expectation, the expected running time is 0(\G\ log \G\ loglog |G|). 

We show that the construction algorithm uses 0(1 + (|G| /fi)log(|G| /M) minjlog log |G|,log(|G| /Af)}) 
memory transfers in the DAM and cache-oblivious models. We analyze large and small nodes in 
the relax-balanced decomposition tree differently. There are two cases. The first case is when a 
tree node G p is large, i.e., \G P \ >log 3 |G|. In this case, RelaxBalancedPartition(G / ,) uses expected 
0(1 + (\G P \ /S)min{loglog |G|,log(|G p | /M)}) memory transfers by Lemma [121 Since all nodes a con- 
stant factor smaller than M can be constructed with no memory transfers, we only need consider nodes of 
size n(M). There are 0(log(|G| /M)) levels of nodes of size Q.{M). So the construction of all nodes larger 
than log 3 | G | takes O (1 + ( | G| /B) log ( | G| /M) min{log log | G| , log ( | G| /M) }) expected memory transfers. 

The second case is when \G P \ < log 3 |G|. In this case, we build a complete decomposition tree at 
each node. Therefore by Lemma[6l the cost to build one of these nodes is 0(1 + (\G p \/B)\og(\G p \/M)) 
in expectation. As before, nodes a constant factor smaller than M can be constructed with no mem- 
ory transfers. Therefore, the number of levels containing nodes of size between Q.(M) and less than 
log 3 |Gj is at most 0(log (log 3 \G\/M)). Thus, the construction of all nodes of size 0(log 3 |G|) uses 
0(1 + (|G| /B)log 2 (log 3 \G\/M)) memory transfers in expectation, which is dominated by the first case. 

Now we establish the high probability bounds. We analyze the largest @(log|G|) nodes and the 
remaining nodes of the relax-balanced decomposition tree separately. Any level of the relax-balanced 
decomposition tree below the largest 0(log|G|) nodes has £2(log|G|) nodes. Hence, for each level, 
the construction cost with high probability matches the construction cost in expectation, which is 
0(|G|loglog |G|) expected time and 0(1 + (|G| /B)min{loglog |G|,log(|G| /M)}) expected memory trans- 
fers. Since the construction algorithm is recursive, a relax-balanced partition of nodes a constant frac- 
tion smaller than M uses no memory transfers. Hence, all levels of the relax-balanced decomposition tree 
other than the largest ©(log |G|) nodes can be constructed in 0(|G|log |G|loglog |G|) time in a RAM and 
0(1 + (|G| /5)log(|G| /M)min{loglog |G|,log(|G| /M)}) memory transfers with high probability. 

For the largest ©(log |G|) nodes of the relax-balanced decomposition tree, we establish the high proba- 
bility bounds using a different approach. Similar to the proof of Theorem[8l we examine each relax partition 
tree that is used to build each node of the relax-balanced decomposition tree, and we examine all nodes 
within all of these relax partition trees. However, now there are upper trees and lower trees; we examine the 
nodes within upper and lower trees separately. 

We look at the upper trees of the relax partition trees of the largest ©(log |G|) nodes of the relax-balanced 
decomposition tree. There are ©(log |Gj) upper trees, which are complete binary trees. Following a similar 
analysis to that in the proof of Theorem [U the construction of the largest ©(log |G|) nodes from among the 
©(log |G|) upper trees takes 0(|G|log |G|) time and uses 0(1 + |G|log \ G\/B) memory transfers with high 
probability. 

For the rest of the nodes in the upper trees, the high probability bounds match the expectation bounds, 
both in time and memory transfers by Theorem [8] Therefore building the nodes in the rest of the upper trees 
takes 0(|G|log 2 log |G|) time and uses 0(|G| log 2 log \G\/B) memory transfers with high probability. This 
cost is dominated by the construction cost of the largest ©(log |G|) nodes of the upper trees. 

We now focus on the lower trees of the relax partition trees of the largest ©(log |G|) nodes of the relax- 
balanced decomposition tree. We show that the cost to build all of the lower trees takes time 0(|G|log |G|) 
and uses 0(1 + |G|log|G|/B) memory transfers with high probability (i.e., probability 1 — l/poly(|G|)). 
With high probability, the lower tree of a partition tree Tq of a subgraph G p can be computed in 0(|G P |) on 
a RAM and with 0(|G P | /B) memory transfers in the DAM and the cache-oblivious models. Given a node 
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Gp and its relax partition tree Tg , there are two cases. The first case is when there are £2 (log |G|) leaves 
of the upper tree of Tq that need to be refined. Thus, with high probability, the construction cost of the 
lower tree of Tq matches the expected construction cost, which is in 0(\G P \) time and 0(|G P | /B) memory 
transfers, as analyzed in Lemma [L2l 

The second case is when there are 0(log |G|) leaves of the upper tree of Tq that need to be refined. 
The construction cost of a single leaf is 0(\G P \ / log 2 |G|) time and 0(\G P \ /Slog 2 |G|) memory transfers in 
expectation. Thus, the construction cost to refine a single leaf with high probability is 0(\G P \ /log |G|) time 
and 0{\G P \ /Slog |G|) memory transfers and the construction cost to refine all leaves with high probability is 
0(\G P \) time and 0{\G P \ /B) memory transfers. Thus, all lower trees of the relax partition trees of the largest 
©(log |G|) nodes of the relax-balanced decomposition tree can be constructed in 0(|G|) time and 0(\G\/B) 
memory transfers with high probability, which is dominated by the construction of all upper trees. 

Hence, with high probability, the construction algorithm runs in 0(\G\ log |G| log log |G|) time on a 
RAM and uses 0(1 + (|G| /B)(log(|G| /M)min{loglog |G|,log(|G| /M)} + log |G|)) memory transfers in 
the DAM and the cache-oblivious models. ■ 

We now show that a relax-balanced decomposition tree can serve the same purpose as a fully-balanced 
decomposition tree in giving cache-oblivious layout. The crucial property is the following. 

Lemma 15 Given a relax-balanced decomposition tree of graph G, all nodes on any level of the relax- 
balanced decomposition tree contain the same number of vertices to within an o (1) factor and all outgoing 
degrees are the same size to within an o(l) factor. 

Proof From the definition of relax-balanced, for any subgraph G p and its two children G Po and G Pl 
(outgoing (G pQ ) | = (outgoing (G pl )| ± 0(|outgoing(G pl )| /log 2 |G|), and |G p0 | = |G pl | ± 0(\G p \ /log 3 |G|). 
Thus, for constant c, the ratio of the outgoing degree or the size between any two subgraphs at depth i is at 
most (1 +c/log 2 |G|)' and (1 +c/log 3 |G|) ! . Since there are 0(log |G|) levels, these quantities differ by at 
most an o(l) factor, as promised. ■ 

Similar to Section|4j to find a cache-oblivious layout of a well shaped mesh G, we build a relax-balanced 
decomposition tree of G. The in-order traversal of the leaves gives the cache-oblivious layout. Lemma [T5l 
guarantees that we can apply the same analysis from Section @] to show that we have a cache-oblivious 
layout. 

We thus obtain the following result: 

Theorem 16 A cache-oblivious layout of a well shaped mesh G can be computed in time 
0(\G\ log |G| loglog |G|) on a RAM both in expectation and with high probability. The 
cache-oblivious layout can be computed in the DAM and cache-oblivious models using 
0(1 + (|G|/B)log(|G|/M)min{loglog|G|,log(|G|/M)}) memory transfers in expectation and 
0(l + (|G|/fi)(log(|G|/M)min{loglog|G|,log(|G|/M)} + logG)) memory transfers with high prob- 
ability. 

6 Applications and Related Work 
Applications of Mesh Update 

The mesh update problem appears in many scientific computations and ranks among most basic primitives 
for numerical computations. In finite-element and finite-difference methods, one must solve very large- 
scale sparse linear systems whose underlying matrix structures are meshes [27]. In practice, these linear 
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systems are solved by conjugate gradient or preconditioned conjugate gradient methods [15, 31]. The most 
computational intensive operation of conjugate gradient is a matrix-vector multiplication operation [6, 15, 
36, 37] which amounts to a mesh update in finite-element applications. The iterative conjugate gradient 
method repeatedly performs mesh updates. Mesh update is also the key operation in fast multipole methods 
(FMM) for N-body simulation [19,20], especially when particles are not uniformly distributed [32]. The 
partitioning and layout techniques presented here also apply to the adaptive quadtrees or octtrees used in 
non-uniform N-body simulation. 

Related Work 

The cache-oblivious memory model was introduced in [17,30], and cache-oblivious algorithms have been 
developed for many scientific problems such as matrix multiplication, FFT, and LU decomposition [8, 17, 
30,35], Now the area of cache-oblivious data structures and algorithms is a lively field. 

There are other approaches to achieving good locality in scientific computations. One alternative to the 
cache-oblivious approach is to write self-tuning programs, which measure the memory system and adjust 
their behavior accordingly. Examples include scientific applications such as FFTW [16], ATLAS [39], and 
self-tuning databases (e.g., [38]). The self-tuning approach can be complementary to the cache-oblivious 
approach. For example, some versions of FFTW [16] begin optimization starting from a cache-oblivious 
algorithm. 

Methods exploiting locality for both sequential (out-of-core) and parallel implementation of iterative 
methods for sparse linear systems have long history in scientific computing. Various partitioning algorithms 
have been developed for load balancing and locality on parallel machines [21, 22, 27, 31], and algorithms 
that have good temporal locality have been proposed and implemented for the out-of-core sparse linear 
solvers [34]. A mesh update can be viewed as a sparse matrix-dense vector multiplication, and there exist 
upper and lower bounds on the I/O complexity of this primitive [6]. However, these bounds apply to any 
type of matrix, whereas special structure of well-shaped meshes enables more efficient mesh updates. 

Since the mesh-update problem is reminiscent of graph traversal, we briefly summarize a few results in 
external-memory graph traversal. The earliest papers in this area apply to general directed graphs [1, 12, 13, 
29] and others focus on more specialized graphs, such as planar directed graphs [4] or undirected graphs 
perhaps of bounded degree [5, 14, 25, 26, 28]. The problem of cache-oblivious graph traversal and related 
problems is addressed by [3, 10]. There are also external-memory and cache-oblivious algorithms for other 
common graph problems, but such citations are beyond the scope of this paper. 

The problem of cache-oblivious mesh layouts is first described in [40]. This paper gives no theoretical 
guarantees either on the traversal cost or the cost to generate the mesh layout, however. It does propose 
heuristics for mesh layout that give good running times, in practice, for a range of types of mesh traversals. 
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